This repository was archived by the owner on Sep 24, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 16
Create global mlflow run and use it for checkpoints #144
Open
irenedea
wants to merge
3
commits into
single-controller-hackathon
Choose a base branch
from
irene/checkpoints-mlflow
base: single-controller-hackathon
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3e87ee0
to
b46fc36
Compare
rithwik-db
reviewed
Aug 10, 2025
6d38c89
to
8880bc0
Compare
works comment rebased and updated supporting absolute paths and dbfs adding artifacts moved to mlflow utils file minor fix slight adjustment
8880bc0
to
cb5fc0b
Compare
rithwik-db
reviewed
Aug 11, 2025
Comment on lines
+86
to
+98
# NOTE: This doesn't work yet for a few reasons: | ||
# 1. Downloading nested mlflow artifacts doesn't work correctly due to the MlflowObjectStore | ||
# having issues. For instance, https://github.com/mosaicml/composer/blob/4ae29b1afec56ce2d54f6fa07a7f9578a0d364b0/composer/utils/object_store/mlflow_object_store.py#L465-L476 | ||
# requires `tmp_path = os.path.join(tmp_dir, os.path.basename(artifact_path))` instead of what it currently | ||
# does. By doing that, the symlink can be loaded correctly. | ||
# 2. If save_folder is an absolute path (e.g. /tmp/checkpoints), the symlink will be created using this | ||
# absolute path. This is not a valid symlink in mlflow so we need to do some os.path gymnastics to | ||
# support absolute paths for save_folder. | ||
# 3. We also need to support save_folder being a dbfs path eventually. | ||
# Proposed Approach | ||
# - Create an MlflowCheckpointActor (allowing us to set WORLD_SIZE=1) | ||
# and create functions within that are based on MlflowObjectStore. | ||
# that safely handle dbfs paths and absolute paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is NOT ready for review since there's a lot of os.path
gymnastics that we are doing for supporting saving things to mlflow artifacts. I am going to keep this PR on hold for now until we have time to think of a more resilient solution that addresses the problems here. (cc: @irenedea @bowenyang008)
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After this PR, we should load the experience buffer from the checkpoints in order for checkpoints to work correctly with async. (Shouldn't be too hard..) It only works for sync right now.
https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/723944411900647/runs/fcbceb3f3c9142539744a0883575ab0a/system-metrics?o=7395834863327820
You can see the metrics /system metrics for two iterations, where the second was a resumption. This was a super small dummy run, so the loss values seem to not show up when they repeat at 0.0... 🤷♀️